💼 Career
|- EY - Manager, Valuation, Modeling, & Economics
|- PG&E - Supervisor, Capital Recovery & Analysis
|- KPMG - Sr Manager, Economics & Valuation
|- Centene - Data Scientist III, Strategic Insights
|- Bloomreach - Sr Manager, Data Ops & Analytics
|- Centene - Lead Machine Learning Engineer
📚 Education
|- Georgia Tech - BS Management
|- UC Irvine - MS Business Analytics
Personal Site: JavierOrracaDeatcu.com | LinkedIn: linkedin.com/in/orraca
💼 Career
|- EY - Microsoft Excel, Microsoft Access, Financial Modeling
|- PG&E - SQL, ODBC (connecting Access / Excel to EDWs)
|- KPMG - VBA scripting, Excel add-ins (Power Query and Power BI) |- Centene - R, Web Apps, Package Dev, ML / AI, Cloud DS Tools
|- Bloomreach - GCP, Amazon Redshift, Linux, Google Workspaces
|- Centene - Docker, k8s, Databricks, Linux, bash, GenAI + LLMs
📚 Education
|- Georgia Tech - Finance, Business Management
|- UC Irvine - Business Analytics, Data Science
Personal Site: JavierOrracaDeatcu.com | LinkedIn: linkedin.com/in/orraca
Image credit: https://vas3k.com/blog/machine_learning
Image credit: https://vas3k.com/blog/machine_learning
Image credit: https://vas3k.com/blog/machine_learning
How should I measure my baseline?
Which ML algorithm(s) should I use?
How should I split my training and testing data?
How should I evaluate my model fit?
Which performance evaluation metric(s) should I use?
Every ML package has a unique set functions and arguments
and the most annoying issue…
data
sucks!
Learn more about Tidymodels at tidymodels.org
Learn more about Tidymodels at tidymodels.org
Image credit: Chen Xing’s Tidymodels Ecosystem Tutorial
Our tidymodels workflow will follow the following steps:
Prediction problem: Predict survival of Titanic passengers
Prediction problem: Predict survival of Titanic passengers
# Load necessary packages
library(tidyverse)
library(tidymodels)
library(titanic)
# Load and prepare data
data(titanic_train)
titanic_data <- as_tibble(titanic_train) |>
mutate(Survived = factor(Survived, levels = c(0, 1))) # Convert target variable to a factor level
# Take a look at the data
glimpse(titanic_data)Rows: 891
Columns: 12
$ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
$ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
$ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
$ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
$ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
$ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
$ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
$ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
$ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
$ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
$ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
# Split the data (create a singular object containing training and testing splits)
set.seed(123)
titanic_split <- initial_split(
data = titanic_data,
prop = 0.75,
strata = Survived # stratify split by Survived column
)
# Create training and testing datasets
train_data <- training(titanic_split)
test_data <- testing(titanic_split)glimpse() at the splitsRows: 667
Columns: 12
$ PassengerId <int> 6, 7, 8, 13, 14, 15, 19, 21, 25, 27, 28, 31, 36, 38, 41, 4…
$ Survived <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Pclass <int> 3, 1, 3, 3, 3, 3, 3, 2, 3, 3, 1, 1, 1, 3, 3, 2, 3, 3, 3, 3…
$ Name <chr> "Moran, Mr. James", "McCarthy, Mr. Timothy J", "Palsson, M…
$ Sex <chr> "male", "male", "male", "male", "male", "female", "female"…
$ Age <dbl> NA, 54, 2, 20, 39, 14, 31, 35, 8, NA, 19, 40, 42, 21, 40, …
$ SibSp <int> 0, 0, 3, 0, 1, 0, 1, 0, 3, 0, 3, 0, 1, 0, 1, 1, 0, 0, 1, 2…
$ Parch <int> 0, 0, 1, 0, 5, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Ticket <chr> "330877", "17463", "349909", "A/5. 2151", "347082", "35040…
$ Fare <dbl> 8.4583, 51.8625, 21.0750, 8.0500, 31.2750, 7.8542, 18.0000…
$ Cabin <chr> "", "E46", "", "", "", "", "", "", "", "", "C23 C25 C27", …
$ Embarked <chr> "Q", "S", "S", "S", "S", "S", "S", "S", "S", "C", "S", "C"…
# Review the proportion of 0s and 1s in each split
train_data |>
summarise(Train_Rows = n(), .by = Survived) |>
mutate(Train_Percent = Train_Rows / sum(Train_Rows)) |>
left_join(test_data |>
summarise(Test_Rows = n(), .by = Survived) |>
mutate(Test_Percent = Test_Rows / sum(Test_Rows)),
join_by(Survived))# A tibble: 2 × 5
Survived Train_Rows Train_Percent Test_Rows Test_Percent
<fct> <int> <dbl> <int> <dbl>
1 0 411 0.616 138 0.616
2 1 256 0.384 86 0.384
# Create a pre-processing recipe
titanic_recipe <- recipe(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare,
data = train_data) |>
step_impute_median(Age) |> # Handle missing values in Age
step_dummy(all_nominal_predictors()) |> # Convert categorical variables to dummy variables
step_normalize(all_numeric_predictors()) # Normalize numeric predictors
titanic_recipeWith {parsnip}, you have a unified interface for ML:
# Create a workflow
titanic_workflow <- workflow() |>
add_recipe(titanic_recipe) |>
add_model(log_model)
titanic_workflow══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_impute_median()
• step_dummy()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)
Computational engine: glm
# Fit the workflow to the training data
titanic_fit <- last_fit(titanic_workflow, titanic_split)
titanic_fit# Resampling results
# Manual resampling
# A tibble: 1 × 6
splits id .metrics .notes .predictions .workflow
<list> <chr> <list> <list> <list> <list>
1 <split [667/224]> train/test split <tibble> <tibble> <tibble> <workflow>
Let’s try modeling with {glmnet} and tuning
# Create cross-validation folds
set.seed(234)
titanic_folds <- vfold_cv(train_data, v = 5, strata = Survived)
titanic_folds# 5-fold cross-validation using stratification
# A tibble: 5 × 2
splits id
<list> <chr>
1 <split [532/135]> Fold1
2 <split [534/133]> Fold2
3 <split [534/133]> Fold3
4 <split [534/133]> Fold4
5 <split [534/133]> Fold5
# Create a grid of hyperparameters to try
log_grid <- grid_regular(
penalty(range = c(-3, 0), trans = log10_trans()),
mixture(range = c(0, 1)),
levels = c(4, 5)
)
log_grid# A tibble: 20 × 2
penalty mixture
<dbl> <dbl>
1 0.001 0
2 0.01 0
3 0.1 0
4 1 0
5 0.001 0.25
6 0.01 0.25
7 0.1 0.25
8 1 0.25
9 0.001 0.5
10 0.01 0.5
11 0.1 0.5
12 1 0.5
13 0.001 0.75
14 0.01 0.75
15 0.1 0.75
16 1 0.75
17 0.001 1
18 0.01 1
19 0.1 1
20 1 1
# Tune the model
set.seed(345)
log_tuning_results <- tune_grid(
tune_workflow,
resamples = titanic_folds,
grid = log_grid,
metrics = metric_set(accuracy, roc_auc)
)
log_tuning_results# Tuning results
# 5-fold cross-validation using stratification
# A tibble: 5 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [532/135]> Fold1 <tibble [40 × 6]> <tibble [0 × 3]>
2 <split [534/133]> Fold2 <tibble [40 × 6]> <tibble [0 × 3]>
3 <split [534/133]> Fold3 <tibble [40 × 6]> <tibble [0 × 3]>
4 <split [534/133]> Fold4 <tibble [40 × 6]> <tibble [0 × 3]>
5 <split [534/133]> Fold5 <tibble [40 × 6]> <tibble [0 × 3]>
# A tibble: 5 × 8
penalty mixture .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.1 0 roc_auc binary 0.843 5 0.0126 Preprocessor1_Model03
2 0.001 0 roc_auc binary 0.843 5 0.0121 Preprocessor1_Model01
3 0.01 0 roc_auc binary 0.843 5 0.0121 Preprocessor1_Model02
4 0.01 0.25 roc_auc binary 0.842 5 0.0114 Preprocessor1_Model06
5 0.01 0.75 roc_auc binary 0.842 5 0.0107 Preprocessor1_Model14
# Fit the final model to the entire training set and evaluate on test set
final_fit <- final_workflow |>
last_fit(titanic_split)
# Get the metrics
collect_metrics(final_fit)# A tibble: 3 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.790 Preprocessor1_Model1
2 roc_auc binary 0.872 Preprocessor1_Model1
3 brier_class binary 0.146 Preprocessor1_Model1
Thank you! 🤍
Connect with me!